INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Observations

Exploratory Data Analysis (EDA)

Univariate Analysis

Observations for number of adults

Observations for number of children

Observations for number of weekend nights

Observations for number of week nights

Observations for meal plan

Observations for parking space

Observations for room type reserved

Observations for lead time

Observations for arrival year

Observations for arrival month

Observations for arrival date

Observations for market segment

Observations for repeat guest

Observations for number of previous cancellations

Observations for number of previous bookings not canceled

Observations for average price per room

Observations for number of special requests

Observations for booking status

Bivariate Analysis

Observations for correlation heatmap

Looking at relationships with target variable, "booking_status"

Observations for booking_status vs no_of_adults

Observations for booking_status vs no_of_children

Observations for booking_status vs no_of_weekend_nights

Observations for booking_status vs no_of_week_nights

Observations for booking_status vs type_of_meal_plan

Observations for booking_status vs required_car_parking_space

Observations for booking_status vs room_type_reserved

Observations for booking_status vs lead_time

Observations for booking_status vs arrival_year

Observations for booking_status vs arrival_month

Observations for booking_status vs arrival_date

Observations for booking_status vs market_segment_type

Observations for booking_status vs repeated_guest

Observations for booking_status vs no_of_previous_cancellations

Observations for booking_status vs no_of_previous_bookings_not_canceled

Observations for booking_status vs avg_price_per_room

Observations for booking_status vs no_of_special_requests

Overall Observations

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Question 1 - What are the busiest months in the hotel?

Question 2 - Which market segment do most of the guests come from?

Question 3 - Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

Observations from the plot

Question 4 - What percentage of bookings are canceled?

Question 5 - Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

Question 6 - Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Data Preprocessing

Missing Value Treatment

There are no missing values so there is no missing value treatments need to happen.

Feature Engineering

Outlier Detection and Treatment

Preparing the data for modeling

EDA

Univariate Analysis (for new features)

Observations

Observations

Bivariate Analysis (for new features)

Observations for correlation heatmap

Observations for booking_status vs no_of_people

Observations for booking_status vs no_of_days_stayed

Logistic Regression Modeling

Checking Multicollinearity

Variance Inflation Factor (VIF)

Building a Logistic Regression model

Model performance evaluation

Coefficient interpretations

Some coefficient interpretations

Performance Metrics of the final model - 'lg2'

ROC-AUC Curve on training set

Optimal threshold using AUC-ROC curve

Checking model performance on training set with optimal threshold from AUC-ROC

Precision-Recall Curve

At threshold around 0.42 there is equal precision and recall but taking a step back and selecting a value around 0.39 will provide a higher recall and a good precision.

Checking model performance with optimal threshold curve on training set

Final Model Summary

Checking the performance on the test set

Dropping the columns from the test set that were dropped from the training set

Using model with default threshold

Conclusions

Decision Tree Modeling

Model evaluation criterion

Model can make wrong predictions as:

  1. False Positive: Predicting a customer will not cancel their booking but in reality the customer canceled the booking leading to loss of revenue for INN Hotels in the form of getting the room ready, not being able to resell the room, etc.

  2. False Negative: Predicting a customer will cancel their booking but in reality the customer did not cancel the booking leading to loss of opportunity if INN Hotels decided to try to book the room to someone else at a lower price.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Building a Decision Tree model

Initial Observation

Visualizing the Decision Tree

Looking at the visual decision tree, and the text report of the decision tree, the tree is very complex with lots of decision nodes and branches, which explains why it performed well with training data and not as well with test data.

The top three features for determing if a booking is canceled is lead_time, avg_price_per_room, and arrival_date.

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set (Hyperparameter tuned decision tree)

After hyperparameter tuning, the recall has decreased for the training set with hyperparameter tuning.

Checking performance on test set (Hyperparameter tuned decision tree)

After hyperparameter tuning, the recall has increased for the test set with hyperparameter tuning.

After hyperparameter tuning, the decision tree looks simpler compared to the non-hyperparameter tuned decision tree, and improvement was seen in recall for test data. However, the decision tree still looks complex. Will try pruning to see if it would further simplify decision tree and if recall can be further improved.

Do we need to prune the tree?

Yes, the tree needs to be pruned as it still is complex after hyperparameter tuning. Will utilize Cost Complexity Pruning.

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Recall vs alpha for training and testing sets

Observations

Model Performance Comparison and Conclusions

Observations between the Logistic Regression and Decision Tree models

Actionable Insights and Recommendations

Profitable policies for cancellations and refunds INN Hotels can adopt

Recommendations for INN Hotels